Introduction

This is Zachary Sherker’s hw005 for STAT 545.

In this assignment, I will:

1)Reorder a factor based on the data and demonstrate the effect in arranged data and in figures. 2)Write data to file and load it back into R. 3)Improve a figure through the use of factor levels, smoother mechanics, color schemes. 4)Convert this to a plotly visual. # Part 1: Factor management First, I will start by uploading the required tools and data sets

suppressPackageStartupMessages(library(tidyverse))
## Warning: package 'tidyverse' was built under R version 3.4.2
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.3
suppressPackageStartupMessages(library(gapminder))
## Warning: package 'gapminder' was built under R version 3.4.2
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(plotly))  
suppressPackageStartupMessages(library(scales))
## Warning: package 'scales' was built under R version 3.4.4

Elaborating the gapminder data set

  1. Drop Oceania I will begin by filtering Oceania data out of the gapminder data set:
Oceania_drop <- gapminder %>% 
  filter(continent != "Oceania")
## Warning: package 'bindrcpp' was built under R version 3.4.4
str(Oceania_drop)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

I will then drop the unused Oceania data:

First_drop <- Oceania_drop %>% 
  mutate(continent=fct_drop(continent))
  str(First_drop)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
Second_drop <- Oceania_drop %>% 
  droplevels() 
  str(Second_drop) 
## Classes 'tbl_df', 'tbl' and 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
  1. Reorder the levels of country or continent The conintents in the gapminder dataset are currently ordered alphabetically:
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

We will first reorder the continents by mean GDPpercapita (mean GDP shown in plot):

gapminder %>% 
  mutate(continent=fct_reorder(continent,gdpPercap,mean)) %>% 
ggplot(aes(continent,gdpPercap)) + geom_violin(aes(fill=continent))+
  stat_summary( fun.y=mean, colour="green", geom="point", size=2,show.legend  = TRUE ) +
  stat_summary( fun.y=mean, colour="purple", geom="text", size = 4, show.legend  = TRUE, 
               vjust=-0.7, aes( label=round( ..y.., digits=1 ) ) ) 

I will now reorder the continents by maximum gdpPercapita (max GDP shown in plot):

gapminder %>% 
  mutate(continent=fct_reorder(continent,gdpPercap,max)) %>% 
ggplot(aes(continent,gdpPercap)) + geom_violin(aes(fill=continent))+
stat_summary( fun.y=max, colour="green", geom="point", size=2,show.legend  = TRUE ) +
  stat_summary( fun.y=max, colour="purple", geom="text", size = 4, show.legend  = TRUE, 
               vjust=-0.7, aes( label=round( ..y.., digits=1 ) ) ) 

Finally, I will reorder the continents by minimum GDPpercapita (min. GDP shown in plot):

gapminder %>% 
  mutate(continent=fct_reorder(continent,gdpPercap,min)) %>% 
ggplot(aes(continent,gdpPercap)) + geom_violin(aes(fill=continent))+
  stat_summary( fun.y=min, colour="green", geom="point", size=2,show.legend  = TRUE ) +
  stat_summary( fun.y=min, colour="purple", geom="text", size = 4, show.legend  = TRUE, 
               vjust=-0.7, aes( label=round( ..y.., digits=1 ) ) ) 

## Part II: File I/O I start by filtering the data to only show information from the Americas in 2002:

filterdata <- gapminder %>% 
  filter(continent == "Americas" & year == 2002) 

# drop unused levels
AmericasData <- filterdata %>% 
 droplevels() 
# check the levels of continent and country to be sure unused data is dropped.
str(AmericasData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    25 obs. of  6 variables:
##  $ country  : Factor w/ 25 levels "Argentina","Bolivia",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 1 level "Americas": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ lifeExp  : num  74.3 63.9 71 79.8 77.9 ...
##  $ pop      : int  38331121 8445134 179914212 31902268 15497046 41008227 3834934 11226999 8650322 12921234 ...
##  $ gdpPercap: num  8798 3413 8131 33329 10779 ...

write/read csv

I will now write the new dataset out in a csv file:

write_csv(AmericasData,"AmericasData.csv")

And read it back in as a csv file:

read_AmeicasData<- read_csv("AmericasData.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_integer(),
##   lifeExp = col_double(),
##   pop = col_integer(),
##   gdpPercap = col_double()
## )

I will now check the newly read-in dataset:

head(read_AmeicasData)
## # A tibble: 6 x 6
##   country   continent  year lifeExp       pop gdpPercap
##   <chr>     <chr>     <int>   <dbl>     <int>     <dbl>
## 1 Argentina Americas   2002    74.3  38331121     8798.
## 2 Bolivia   Americas   2002    63.9   8445134     3413.
## 3 Brazil    Americas   2002    71.0 179914212     8131.
## 4 Canada    Americas   2002    79.8  31902268    33329.
## 5 Chile     Americas   2002    77.9  15497046    10779.
## 6 Colombia  Americas   2002    71.7  41008227     5755.

save/read rds

# save to RDS file
saveRDS(AmericasData, "AmericasData.rds")

# read from RDS file
read_RDSdata <- readRDS("AmericasData.rds")

# check readin data
 head(read_RDSdata) 
## # A tibble: 6 x 6
##   country   continent  year lifeExp       pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Argentina Americas   2002    74.3  38331121     8798.
## 2 Bolivia   Americas   2002    63.9   8445134     3413.
## 3 Brazil    Americas   2002    71.0 179914212     8131.
## 4 Canada    Americas   2002    79.8  31902268    33329.
## 5 Chile     Americas   2002    77.9  15497046    10779.
## 6 Colombia  Americas   2002    71.7  41008227     5755.

Put data into text file and subsequently read from text file

# put data into text file
dput(AmericasData, "AmericasData.txt")

# retrieve data from text file
data_txt <- dget("AmericasData.txt")
 
head(data_txt) 
## # A tibble: 6 x 6
##   country   continent  year lifeExp       pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Argentina Americas   2002    74.3  38331121     8798.
## 2 Bolivia   Americas   2002    63.9   8445134     3413.
## 3 Brazil    Americas   2002    71.0 179914212     8131.
## 4 Canada    Americas   2002    79.8  31902268    33329.
## 5 Chile     Americas   2002    77.9  15497046    10779.
## 6 Colombia  Americas   2002    71.7  41008227     5755.

Part III: Vizualization of data

I will start by creating a basic graph comparing the GDPpercap of all countries within continental groupings:

ggplot(gapminder,aes(gdpPercap,continent))+
  geom_line(aes(colour=continent,size=gdpPercap),alpha=0.8)

I will now modify the graph to make it more informative by first reorganizing the data:

# get max, min, median and mean GDP for each contient in all years
  Reorganized_data <-  gapminder %>% 
  group_by(continent,year) %>% 
summarize(
  min_gdp = min(min(gdpPercap)),
  max_gdp = max(max(gdpPercap)),
  mean_gdp = mean(mean(gdpPercap))
)
Reorganized_table <- gather(Reorganized_data,key = "Type_GDP", value="Value_GDP", min_gdp, max_gdp,mean_gdp)
# then check the new gathered table
knitr::kable(head(Reorganized_table))
continent year Type_GDP Value_GDP
Africa 1952 min_gdp 298.8462
Africa 1957 min_gdp 335.9971
Africa 1962 min_gdp 355.2032
Africa 1967 min_gdp 412.9775
Africa 1972 min_gdp 464.0995
Africa 1977 min_gdp 502.3197
Now I will p lot thi s data:
Reorganized_graph <-  Reorganized_table %>% 
  ggplot(aes(x = year, y = Value_GDP, color = Type_GDP) ) +
  facet_wrap(~continent) +
  scale_y_log10(label=dollar_format())+
  scale_x_continuous()+
  geom_point()+
  geom_line()+
       labs(x = "year",
          y = "GDP",
          title = "Variables of GDPpercap per continent per year")+
  theme(axis.text = element_text(size= 10),
          strip.background = element_rect(fill = "green"),panel.background = element_rect(fill = "white"))
  
  Reorganized_graph

As you can see, this reorganized data makes for a much more informative graph, allowing us to observe larger trends in a simple format. (2) Convert graph to plotly

ggplotly(Reorganized_graph)

The plotly version of this graph allows for us to access the information portrayed by simply scrolling our mouse over the data points, making the graph much more informative still. ## Part IV: Writing figures to file

ggsave("modified_graph.png", width=16, height=6, units = "cm")